63 research outputs found

    On Isotropy, Contextualization and Learning Dynamics of Contrastive-based Sentence Representation Learning

    Full text link
    Incorporating contrastive learning objectives in sentence representation learning (SRL) has yielded significant improvements on many sentence-level NLP tasks. However, it is not well understood why contrastive learning works for learning sentence-level semantics. In this paper, we aim to help guide future designs of sentence representation learning methods by taking a closer look at contrastive SRL through the lens of isotropy, contextualization and learning dynamics. We interpret its successes through the geometry of the representation shifts and show that contrastive learning brings isotropy, and drives high intra-sentence similarity: when in the same sentence, tokens converge to similar positions in the semantic space. We also find that what we formalize as "spurious contextualization" is mitigated for semantically meaningful tokens, while augmented for functional ones. We find that the embedding space is directed towards the origin during training, with more areas now better defined. We ablate these findings by observing the learning dynamics with different training temperatures, batch sizes and pooling methods.Comment: Accepted by ACL 2023 (Findings, long paper

    ExBERT: An External Knowledge Enhanced BERT for Natural Language Inference

    Get PDF
    Neural language representation models such as BERT, pretrained on large-scale unstructured corpora lack explicit grounding to real-world commonsense knowledge and are often unable to remember facts required for reasoning and inference. Natural Language Inference (NLI) is a challenging reasoning task that relies on common human understanding of language and real-world commonsense knowledge. We introduce a new model for NLI called External Knowledge Enhanced BERT (ExBERT), to enrich the contextual representation with realworld commonsense knowledge from external knowledge sources and enhance BERT’s language understanding and reasoning capabilities. ExBERT takes full advantage of contextual word representations obtained from BERT and employs them to retrieve relevant external knowledge from knowledge graphs and to encode the retrieved external knowledge. Our model adaptively incorporates the external knowledge context required for reasoning over the inputs. Extensive experiments on the challenging SciTail and SNLI benchmarks demonstrate the effectiveness of ExBERT: in comparison to the previous state-of-the-art, we obtain an accuracy of 95.9% on SciTail and 91.5% on SNLI

    A deep learning approach to fight illicit trafficking of antiquities using artefact instance classification

    Get PDF
    We approach the task of detecting the illicit movement of cultural heritage from a machine learning perspective by presenting a framework for detecting a known artefact in a new and unseen image. To this end, we explore the machine learning problem of instance classification for large archaeological images datasets, i.e. where each individual object (instance) is itself a class that all of the multiple images of that object belongs. We focus on a wide variety of objects in the Durham Oriental Museum with which we build a dataset with over 24,502 images of 4332 unique object instances. We experiment with state-of-the-art convolutional neural network models, the smaller variations of which are suitable for deployment on mobile applications. We find the exact object instance of a given image can be predicted from among 4332 others with ~ 72% accuracy, showing how effectively machine learning can detect a known object from a new image. We demonstrate that accuracy significantly improves as the number of images-per-object instance increases (up to ~ 83%), with an ensemble of classifiers scoring as high as 84%. We find that the correct instance is found in the top 3, 5, or 10 predictions of our best models ~ 91%, ~ 93%, or ~ 95% of the time respectively. Our findings contribute to the emerging overlap of machine learning and cultural heritage, and highlights the potential available to future applications and research

    Predicting Current Glycated Hemoglobin Levels in Adults From Electronic Health Records: Validation of Multiple Logistic Regression Algorithm

    Get PDF
    Background: Electronic health record (EHR) systems generate large datasets that can significantly enrich the development of medical predictive models. Several attempts have been made to investigate the effect of glycated hemoglobin (HbA1c) elevation on the prediction of diabetes onset. However, there is still a need for validation of these models using EHR data collected from different populations. Objective: The aim of this study is to perform a replication study to validate, evaluate, and identify the strengths and weaknesses of replicating a predictive model that employed multiple logistic regression with EHR data to forecast the levels of HbA1c. The original study used data from a population in the United States and this differentiated replication used a population in Saudi Arabia. Methods: A total of 3 models were developed and compared with the model created in the original study. The models were trained and tested using a larger dataset from Saudi Arabia with 36,378 records. The 10-fold cross-validation approach was used for measuring the performance of the models. Results: Applying the method employed in the original study achieved an accuracy of 74% to 75% when using the dataset collected from Saudi Arabia, compared with 77% obtained from using the population from the United States. The results also show a different ranking of importance for the predictors between the original study and the replication. The order of importance for the predictors with our population, from the most to the least importance, is age, random blood sugar, estimated glomerular filtration rate, total cholesterol, non–high-density lipoprotein, and body mass index. Conclusions: This replication study shows that direct use of the models (calculators) created using multiple logistic regression to predict the level of HbA1c may not be appropriate for all populations. This study reveals that the weighting of the predictors needs to be calibrated to the population used. However, the study does confirm that replicating the original study using a different population can help with predicting the levels of HbA1c by using the predictors that are routinely collected and stored in hospital EHR systems

    An Exploration of Dropout with RNNs for Natural Language Inference

    Get PDF
    Dropout is a crucial regularization technique for the Recurrent Neural Network (RNN) models of Natural Language Inference (NLI). However, dropout has not been evaluated for the effectiveness at different layers and dropout rates in NLI models. In this paper, we propose a novel RNN model for NLI and empirically evaluate the effect of applying dropout at different layers in the model. We also investigate the impact of varying dropout rates at these layers. Our empirical evaluation on a large (Stanford Natural Language Inference (SNLI)) and a small (SciTail) dataset suggest that dropout at each feed-forward connection severely affects the model accuracy at increasing dropout rates. We also show that regularizing the embedding layer is efficient for SNLI whereas regularizing the recurrent layer improves the accuracy for SciTail. Our model achieved an accuracy 86.14% on the SNLI dataset and 77.05% on SciTail

    In-Materio Extreme Learning Machines

    Get PDF
    Nanomaterial networks have been presented as a building block for unconventional in-Materio processors. Evolution in-Materio (EiM) has previously presented a way to congure and exploit physical materials for computation, but their ability to scale as datasets get larger and more complex remains unclear. Extreme Learning Machines (ELMs) seek to exploit a randomly initialised single layer feed forward neural network by training the output layer only. An analogy for a physical ELM is produced by exploiting nanomaterial networks as material neurons within the hidden layer. Circuit simulations are used to eciently investigate diode-resistor networks which act as our material neurons. These in-Materio ELMs (iM-ELMs) outperform common classication methods and traditional articial ELMs of a similar hidden layer size. For iM-ELMs using the same number of hidden layer neurons, leveraging larger more complex material neuron topologies (with more nodes/electrodes) leads to better performance, showing that these larger materials have a better capability to process data. Finally, iM-ELMs using virtual material neurons, where a single material is re-used as several virtual neurons, were found to achieve comparable results to iM-ELMs which exploited several dierent materials. However, while these Virtual iM-ELMs provide signicant exibility, they sacrice the highly parallelised nature of physically implemented iM-ELMs

    PetBERT: automated ICD-11 syndromic disease coding for outbreak detection in first opinion veterinary electronic health records

    Get PDF
    Effective public health surveillance requires consistent monitoring of disease signals such that researchers and decision-makers can react dynamically to changes in disease occurrence. However, whilst surveillance initiatives exist in production animal veterinary medicine, comparable frameworks for companion animals are lacking. First-opinion veterinary electronic health records (EHRs) have the potential to reveal disease signals and often represent the initial reporting of clinical syndromes in animals presenting for medical attention, highlighting their possible significance in early disease detection. Yet despite their availability, there are limitations surrounding their free text-based nature, inhibiting the ability for national-level mortality and morbidity statistics to occur. This paper presents PetBERT, a large language model trained on over 500 million words from 5.1 million EHRs across the UK. PetBERT-ICD is the additional training of PetBERT as a multi-label classifier for the automated coding of veterinary clinical EHRs with the International Classification of Disease 11 framework, achieving F1 scores exceeding 83% across 20 disease codings with minimal annotations. PetBERT-ICD effectively identifies disease outbreaks, outperforming current clinician-assigned point-of-care labelling strategies up to 3 weeks earlier. The potential for PetBERT-ICD to enhance disease surveillance in veterinary medicine represents a promising avenue for advancing animal health and improving public health outcomes

    Length is a Curse and a Blessing for Document-level Semantics

    Full text link
    In recent years, contrastive learning (CL) has been extensively utilized to recover sentence and document-level encoding capability from pre-trained language models. In this work, we question the length generalizability of CL-based models, i.e., their vulnerability towards length-induced semantic shift. We verify not only that length vulnerability is a significant yet overlooked research gap, but we can devise unsupervised CL methods solely depending on the semantic signal provided by document length. We first derive the theoretical foundations underlying length attacks, showing that elongating a document would intensify the high intra-document similarity that is already brought by CL. Moreover, we found that isotropy promised by CL is highly dependent on the length range of text exposed in training. Inspired by these findings, we introduce a simple yet universal document representation learning framework, LA(SER)3^{3}: length-agnostic self-reference for semantically robust sentence representation learning, achieving state-of-the-art unsupervised performance on the standard information retrieval benchmark.Comment: Accepted at EMNLP 2023. Our code is publicly available at https://github.com/gowitheflow-1998/LA-SER-cube
    • …
    corecore